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This paper introduces a new approach to analysing spatial point 
data clustered along or around a system of curves or 'fibres'. Such 
data arise in catalogues of galaxy locations, recorded locations of 
earthquakes, aerial images of minefields, and pore patterns on finger- 
prints. Finding the underlying curvilinear structure of these point- 
pattern data sets may not only facilitate a better understanding of 
how they arise but also aid reconstruction of missing data. We base 
the space of fibres on the set of integral lines of an orientation field. 
Using an empirical Bayes approach we estimate the field of orien- 
tations from anisotropic features of the data. We then sample from 
the posterior distribution of fibres, exploring models with different 
numbers of clusters, fitting fibres to the clusters as we proceed. The 
Bayesian approach permits inference on various properties of the clus- 
ters and associated fibres, and the results perform well on a number 
of very different curvilinear structures. 

1. Introduction. In this paper we introduce a new empirical Bayes ap- 
proach concerning point processes that are clustered along curves or 'fibres', 
with additional background noise. 

In nature such point patterns often arise when events occur near some latent 
curvilinear generating feature. For example earthquakes arise around seismic 
faults which lie on the boundaries of tectonic plates and hence are naturally 
curvilinear. Similarly, sweat pores in fingerprints lie on the ridges of the 
finger, which possess a curvilinear structure. Figure 1 presents these data 
together with two simulated examples of point patterns clustered around 
underlying families of curves with additional background noise. Identifica- 
tion of curvilinear elements and elucidation of their relationship with the 
point data is both an interesting theoretical problem and a useful tool for 
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gaining understanding of the origins of the data. It also provides a technique 
for reconstruction of missing data. 

The model introduced here describes families of non-intersecting curves via 
a field of orientations (a map vpo : — )• [0, vr) assigning an undirected ori- 
entation to each point in the window). The curves are produced as segments 
of streamlines integrating the field of orientations. We say that a curve inte- 
grates the field of orientations if the curve is continuously differentiable and 
of unit speed, and if its tangent agrees with the field of orientations at each 
point. The term streamline is used to describe a curve which integrates 
the field of orientations and has no end points in the interior of the window 
W\dW. 

We choose to use a variant on an empirical Bayes approach to estimate the 
field of orientations, since a fully Bayesian approach would involve infinite 
dimensional distributions and be very computationally intensive. The empir- 
ical Bayes component consists of estimation of the field of orientations from 
the data via a tensor field as detailed in Section 4.1. In the following, a ten- 
sor field is represented by assignation of a symmetric non-negative definite 
matrix to each point of the planar window. Tensor fields of this kind play 
an important role in Diffusion Tensor Imaging (DTI, see Le Bihan et al., 
2001). The field of orientations is constructed simply by calculating the ori- 
entations of the representative matrices' principal eigenvectors; singularities 
in the field of orientations correspond to points where there is equality of 
the two eigenvectors. 

We show how properties of the underlying distribution of fibres can be esti- 
mated using Monte Carlo techniques applied to the spatial point data. Our 
approach has the advantage that it can be used to quantify uncertainty on 
a range of parameters and does so effectively for different types of curvilin- 
ear structure. The use of a field of orientations to identify fibres leads to a 
strong performance on data such as that shown in Figure 1(d), where there 
is noticeable alignment of points perpendicular to the fibres. 

1.1. Potential Applications. Point patterns with a latent curvilinear struc- 
ture arise in many different areas of study. 

In seismology, epicentres of earthquakes are typically densely clustered around 
seismic faults. The earthquake data from the New Madrid region as shown 
in Figure 1(c) consists of one short dense cluster of points, one longer rather 
sparse cluster, both connected, and a relatively small number of 'noise' 
points scattered over the window. The New Madrid earthquake data is con- 
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(a) Simulated point pattern. 
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(b) Simulated point pattern de- 
scribed in Stanford and Raftery 
(2000). 
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(c) Earthquake epicentres in the New 
Madrid region. Data is taken from 
CERI (Center for Earthquake Re- 
search and Information). 
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(d) Pores along ridges of a portion 
of the fingerprint a002-05 from the 
NIST Special Database 30 Watson 
(2001). 



Fig 1 . Four examples of pomt patterns clustered around latent curvilinear features with 
background noise. 
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sidered in Stanford and Raftery (2000) 's approach to detecting curvilinear 
features. 

In cosmology, galaxies appear to cluster along inter-connected filaments 
forming a 3-dimensional web-like structure with large voids between the 
filaments. There is interest in identifying the nature of the filaments (see 
for example Stoica, Martinez and Saar, 2007). There is also evidence that 
these galaxies form surfaces or 'walls' in some regions. This suggests the 
exciting challenge of extending our model to include 2-dimensional surfaces 
in 3-dimensional space. 

A further application is that of sweat pore patterns on fingerprint ridges (see 
Figure 1(d)). Sweat pores are tiny apertures along the ridges where the ducts 
of the sweat glands open. Robust inference of the ridge structure from the 
pore pattern has potential for aiding reconstruction of patchy fingerprints 
and may also allow for more efficient storage of fingerprints in very large 
databases. The underlying curve structure is a dense set of locally parallel 
curves along which pores are located, usually very close to the centre of the 
ridges. The noise arises mostly from artifacts in the automatic extraction of 
pores from the image. 

An issue with fingerprint pore data is that pores usually align across the 
ridges as well as along them. This can complicate the reconstruction of 
ridges as the dominant orientation is less clear. We overcome this issue by 
constructing a smooth tensor field which extrapolates dominant orientation 
estimates over the regions of directional ambiguity. 

1.2. Background. An existing method of estimating the curves in the un- 
derlying structure of a point process is Stanford and Raftery (2000) 's use 
of principal curves (a nonlinear generalization of the first principal compo- 
nent line). An EM-algorithm is used to optimize the model for a number of 
combinations of smoothness and number of components. An optimal choice 
of smoothness and number of components is then selected using Bayes fac- 
tors. This technique generally performs very well; however it is sensitive to 
the initial clustering of the data and therefore has difficulties reconstructing 
fibres in some regions where fibres may be expected but signal points are 
absent (for example the fingerprint pore data - Figure 1(d)). 

A piecewise linear 'Candy Model' (or 'Bisous Model' in three dimensions) 
is used by Stoica, Martinez and Saar (2007) to model filaments in galaxy 
data. They compare the empirical densities of galaxies within concentric 
cylinders and thus delineate these filaments. This approach is restricted 
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to piecewise linear fibre models where the deviation of points from fibres 
follows a uniform distribution over a thin cylinder centred along the fibre. 
Sufficient statistics of the model for data with filamentary structure are then 
compared to sufficient statistics on structureless data sets, see Stoica et al. 
(2005); Stoica, Martinez and Saar (2007) and Stoica, Martinez and Saar 
(2010). 

Density estimates of the point pattern can be obtained using techniques such 
as kernel smoothing. Fibres can be directly estimated from this density; an 
example of this can be seen in Genovese et al. (2009) where steepest ascent 
paths along the density estimate are constructed and the density of these 
paths is analysed. 

A further approach discussed in Barrow, Bhavsar and Sonoda (1985) is based 
on construction of the minimal spanning tree of the set of points. In three 
dimensions this gives a useful insight into the overall characteristics of the 
filamentary structure. 

The method presented in August and Zucker (2003) is based on a random 
curve model in which curvature is defined as a Brownian motion. The re- 
sulting model is used to enhance contours in the output of edge operators 
applied to digital images and thus to data in which signal points are dense 
along curvilinear structures. 

The treatment advocated here is also based on the formulation of a general 
model for families of curves and the point patterns clustered around them. 
In contrast to August and Zucker (2003), curves are modelled as segments 
of streamlines integrating a smooth field of orientations which encourages 
interpolations over areas of missing data. The prior model for the orientation 
field is derived via an empirical Bayes step. Then Birth-Death MCMC is 
used to sample from the posterior distribution of the fibres. The model 
formulation itself uses the initial exploratory work of Su et al. (2008) (see 
also Su, 2009), which focussed on the fingerprint pore data and described 
the use of tensor fields for estimating dominant orientations in spatial point 
data. 

1.3. Problem Definition. In particular we are interested in modelling a ran- 
dom point process 11 viewed in a planar window W C 'M?; we write the ob- 
served part of the point process as W CiH = {yi, ...,ym} for some arbitrary 
ordering of points. The point process arises from a mixture of homogeneous 
background noise and an unknown number of point clusters, each clustered 
along a curve, henceforth called a fibre. Thus a fibre is a one-dimensional 
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object, a smooth curved segment, embedded in a higher dimensional space 
(the space containing the point process). Random sets of fibres or 'fibre pro- 
cesses' are discussed in Stoyan, KendaU and Mecke (1995) and Ilhan et al. 
(2008). 

Having specified an appropriate model we must identify a suitable method 
of analysis of the posterior distribution of fibres given a data set of spatial 
point locations {yi, ...,ym} over the window W. 

1.4. Plan of Paper. The paper is laid out as follows. The following section 
gives an overview of the model proposed in this paper. Details of the under- 
lying probability model are given in Section 3. The empirical Bayes method 
of estimating an appropriate field of orientations is outlined in Section 4. 
Section 5 presents a method of sampling from the posterior distribution of 
the fibres given the point pattern data using Monte Carlo methods. This is 
implemented for a number of examples in Section 6. In the final section we 
compare this model to other approaches, discuss some known issues of im- 
plementation and statistical analysis, and note possible directions in which 
this model might be extended. 

2. Basic Considerations. We use a Bayesian hierarchical model to de- 
scribe the relationship between the points and the fibres. 

2.1. Points. A natural choice is to model the spatial point process as a 
mixed Poisson process or 'Cox process' driven by a random fibre process. 
By this we mean that the points arise independently and are associated in 
some way with a random fibre - typically clustered around it. Such a point 
process is called a 'fibre-process generated Cox process', see Illian et al. 
(2008). 

In our model we do associate points with particular fibres but we remove 
the Poissonian character of the distribution of points along fibres, replacing 
this by a renewal process based on Gamma distributions for interpoint dis- 
tances. This allows us to model a tendency to regularity in the way in which 
points are distributed along a fibre but includes the Poissonian case with 
Exponentially distributed interpoint distances. 

2.2. Fibres. In contrast to previous work, in which curves are often con- 
structed as splines fitted to the data, we define them as integral curves of a 
field of orientations. This means that at any point on a fibre, the tangent to 
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the fibre agrees with the field of orientations at that point. Note that a field 
of orientations is equivalent to a vector field except that each point in the 
field is assigned a directionless orientation. An instance of a random field of 
orientations Tpo is written as vpo : W ^ [0, vr) where [0, vr) represents the 
space of planar directions (with and vr identified). 

The simplest way to determine a fibre F is to choose a reference point 
u G W on the fibre and specify the two arc lengths li,l2 G of F\{a;}, 
see Figure 2. For a fixed field of orientations this will characterize a fi- 
bre, although the parametrization by reference point and length is evidently 
not unique. We model the fibres in terms of these parameters (the refer- 
ence points, arc lengths and field of orientations). Note that an alternative 
construction can be based on random selection of a finite number of fi- 
bres generated by decomposing the streamlines according to Poisson point 
processes distributed along the streamlines; however this construction in- 
troduces intriguing measure-theoretic issues which are out of place in the 
present treatment. 



Fig 2. An orientation field is depicted as thin grey lines. A fibre F{uj,li,l2) is defined 
as the curve segment that integrates the orientation field from reference point uj £ W in 
one direction to a distance li and in the other direction to a distance Recall, a curve 
segment is said to integrate the orientation field if at any point of the segment its tangent 
agrees with the orientation field. 

We note that taking the reference points to be uniformly distributed over 
the window W will lead to a bias in the distribution of fibres in that the 
mean length of fibre per unit area is not constant across W. This issue has 
been considered and a solution involving an adjustment to the distribution of 
reference points has been identified. We have not applied the bias correction 
to our examples as there is sufficient data to make the bias negligible. 
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The field of orientations is a useful intermediary in constructing fibres, and 
as such is part of a useful decomposition of the construction problem. In 
practice we seek to identify a suitable field of orientations by analysis of 
properties of the data. 

2.3. Noise. Finally we include background noise in the form of an indepen- 
dent homogeneous Poisson process superimposed onto the fibre-generated 
'signal' point process. 

3. Probability Model. A Directed Acyclic Graph (or DAG) showing 
the conditional dependencies for the model is shown in Figure 3. A good 
introduction to Directed Acyclic Graphs is given by Pearl (1988). 

3.1. Fibres. Henceforth let F = {Fi, ...,Fk} denote k random fibres. As 
outlined earlier and illustrated in Figure 2, the fibre Fj is determined by a ref- 
erence point Uj and arc lengths Ij^i, lj^2- It is also written Fj = Fj{u}j, Ij, vfo) 
(where Ij = lj,2)) to indicate that it is a deterministic function of Uj and 
Ij once i^Fo is given. For the list of reference points we write u = {wi, w^}, 
and the arc length vectors are given by 1 = {^i, .., l^}- We use Ij^T = + lj,2 
as a shorthand for the total arc length of the jth fibre. Note that in general 
the orientation field upo niay possess singularities, which would constrain 
the choice of the lengths Ij = (/j,i,/j,2); however this does not arise in our 
examples. 

3.2. Signal Points. Points from the observed pattern may be either signal 
or noise. Signal points are typically clustered around fibres. The model we 
use assigns an anchor point pi on some fibre to each data point yi. The data 
point is then displaced from pi by an isotropic bivariate normal distribution 
(i.e. yi ~ MVN(pi, cj^jgpl2) where I2 is the 2x2 identity matrix). 

The fibre on which pi is located is determined by an auxiliary variable Xi , so 
Xi = j if and only if pi G Fj. The pj's on the jth fibre are spaced such that 
the vector of arc- length distances between adjacent points is proportional 
to a Dirichlet distributed random variable. Setting an appropriate param- 
eter for the Dirichlet distribution will encourage points to be either evenly 
spread, clustered along the fibre, or placed independently at random along 
the fibre. 

The probability that point yi is allocated to the jth. fibre (Xi = j) is pro- 
portional to the total length of fibre Fj. This ensures that the mean points 
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A" Number of fibres Probability point is signal 

cOj Fibre reference point Zj Signal/noise indicator 

Yj^i, Field of orientations A' Associated fibre 

Fibre lengtii vector Pi Associated point on fibre 

Fibre ^'/^p Dispersion parameter 

m Total number of points 1', Data point 



Fig 3. Directed Acyclic Graph (DAG) of model: arrows indicate conditional dependencies, 
elements in squares are deterministically calculated or constant, whilst those in circles are 
random variables. For simplicity we have not included reference to hyperparameters \, k, 

r] , CYsignal and [^signal- 



per unit streamline remains approximately constant. 

3.3. Noise Points. Noise is then added as a homogeneous Poisson process. 
This is included in the model by first allocating each point yi to noise or sig- 
nal (stored in auxiliary variable Zj = 1 or for signal or noise respectively). 
Point Hi is allocated to signal independently of the allocations of all other 
points. The prior probability that yi is allocated to signal is given by ej. If 
the point is signal then its location is distributed as outlined in the previous 
subsection. Otherwise, if the point is noise, it is distributed uniformly across 
the window W . 
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3.4. Total Number of Points. The total number of points m is assumed to 
be Poisson distributed. The mean total number of points //total is defined 
to be equal to some function of //signal; the mean number of signal points, 
and p, a parameter governing the number of noise points. For the sake of 
simplicity we set p to be the prior assumption on the proportion of the total 
points that are noise points and define /ttotai = /isignai/(l — p)- The mean 
number of signal points /tgignai is assumed to be proportional to the total 
sum of the fibre arc lengths. Hence m is assumed to be Poisson distributed 
with mean 



where p = /3signai/ (^signal + /^signal) IS the prior estimate of the proportion of 
points that are signal and ?] is a density parameter. 

The assumption that the mean number of noise points is proportional to 
the mean total number of points (and the fibre length) is particularly well 
suited to the fingerprint example, see Section 6.3, where noise points arise 
as artifacts of the pore detection process along the fingerprint ridges. Imple- 
mentation of alternative relationships between the mean number of signal 
and noise points would be a straightforward matter. 

3.5. Priors. In the examples given in the next Section 6 we use the following 
priors: 



(1) 




k 



P{l\k,X) 



Yl P(/j,i|A)P(/j- 2|A) where Ij. ~ Exp(l/A) 



k 



P{u\k) 



P{i-^j) where ujj ~ Uniform(14/') 
Poisson(K) 



P{k\K) 



m 





where ~ Beta(asignai, Aignai)- 



Here m is the total number of points in {yi, ...,ym}- 



FIBERS, POINT PROCESSES, FIELDS OF ORIENTATIONS 



11 



The above prior models are common, parsimonious choices that appear flex- 
ible enough for a range of applications including the examples considered 
in Section 6. However, if application-specific prior information suggests al- 
ternative prior models then these can be accommodated in the presented 
framework. 

3.6. Posterior. We are interested in the posterior distribution of fibres (and 
various other parameters) given a particular instance of the point process. 
This posterior is given by 

(2) 7r(F,l,cj,fc,t;FO,e,Z,X,p) 

= P(F,l,a;,A;,t;Fo,e,Z,X,p|y) 
oc P(F,l,a;,A;,t;FO,e,Z,X,p) 

xL(F, 1, w, k, wpo, e, Z, X, p|y) 
= P{l\k)P{u\k)P{k)P{vFo)P{e)P{Z\e) 

xP(X|Z, l)P(p|F, X)L(F, 1, u, k, VFo, e, Z, X, p|y). 

Here P(-) indicates a prior distribution. We omit P(F|li, I2, a;, upo) as it is 
deterministically calculated. 

Section 5 describes how to sample from this posterior distribution using 
Markov Chain Monte Carlo techniques. 

3.7. Computational Simplifications. Computer implementation makes it nec- 
essary to represent the field of orientations by a discrete structure. We adopt 
the simple approach of estimating the field of orientations at a dense regular 
grid of points over W. Integral curves are calculated stepwise by estimating 
the orientation at a point by its value at the nearest evaluated grid point 
and extending the curve a small distance in that direction. Note that the 
choice of direction (from the two available for each orientation) is made so 
that the angle between adjacent linear segments is greater than 7r/2. 

Consequently fibres are stored as piecewise-linear curves and further calcu- 
lations are performed on these approximations. Of course this discretization 
can be arbitrarily reduced (at a computational cost) to improve the accuracy 
of the approximation. 

4. Construction of Field of Orientations. We must of course iden- 
tify a method for calculating the field of orientations. It is computationally 
advantageous to generate a field of orientations which is likely to contain 
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(be integrated by) fibres that fit the data well (produce a high likelihood). 
The most natural way to do this is to base the calculation of the field of 
orientations on the data, using an empirical Bayes technique. The use of 
empirical Bayes to find the prior for the field of orientations distribution 
means that aspects of the prior, or parameters of the prior, are estimated 
from the data. 

An alternative approach would be to use a fully Bayesian model, where 
we would treat the field of orientations as an independent random variable 
Tpo- We would then need to identify its state-space and a corresponding 
cr-algebra, transition kernel and prior on this state space. These could be 
derived from random field theory (see, for example, Adler and Taylor, 2007), 
using an appropriate covariance function to maintain smoothness in the field 
of orientations, however there are a number of issues with this approach. In 
practice one may expect the task of sampling a random field of orientations 
to be computationally expensive, particularly if the covariance function does 
not have a simple form (as is likely in this model). Calculations relating to the 
conditional distribution of the field of orientations given the fibres are likely 
to lead to unfeasible computational complexity. A further issue is that this 
approach leads to a huge space of possible fibres, resulting in corresponding 
difficulties in ensuring this space is properly explored. Use of the information 
given in the data will help to limit this space to a more easily explorable 
restricted class of suitable fields of orientations. 

Here we use the data to make local orientation estimates and smooth these 
to produce a field of orientations estimator. As we are specifically interested 
in orientation estimates arising from the signal data, we can choose to weight 
the contribution of each point to the field of orientations estimator by how 
likely it is to be noise or signal. 

Estimation of a field of orientations given that yi, ym are all signal points 
is outlined in the following section. Section 4.2 shows how to extend this 
to take account of the information given in the vector of probabilities that 
points are signal (ei, £2, em)■ 
4.1. Estimation for all Signal Points. The mapping and tensor method 
described in Su et al. (2008) (and further discussed in Su, 2009) is applied 
to the point pattern to construct a tensor at each point. To this we apply a 
Gaussian kernel smoothing in the log-Euclidean metric to construct a tensor 
field. The tensor field is represented by an assignation to each point of a 
2x2 non-negative definite matrix whose principal eigenvector indicates the 
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dominant orientation at that point; the relative magnitude of the eigenvalues 
indicates the strength of the dominant orientation. The field of orientations 
assigns the orientation of this principal eigenvector to each respective point. 
If the principal eigenvector is not unique at a certain point (which is to say 
that the eigenvalues are equal there) then that indicates a singularity in the 
field of orientations. 

Three-dimensional tensor fields of this kind are commonly used in diffusion 
tensor imaging (DTI) to understand brain pathologies such as multiple scle- 
rosis, schizophrenia and strokes. DTI is used to analyse images of the brain 
collected from magnetic resonance imaging (MRI) machines. The MRI scan 
detects diffusion of water molecules in the brain and uses the data to infer the 
tissue structure that limits water flow. The 3-dimensional diffusion tensor 
describes the orientation dependence of the diffusion. Roughly-speaking, the 
eigenvalues indicate a measure of the proportion of water molecules flowing 
in the associated eigenvector direction. For more information on DTI see, 
for example, Le Bihan et al. (2001) and Li et al. (2007). 

Let yi,...,ym denote the spatial data points. A tensor is constructed at 
a point Uj using a non-linear transformation applied to the vectors f * = 
{v{,vi) = yft/i fov i ^ j (Su et al., 2008; Su, 2009). Specifically, 

(3) = {v\,v^,) 

where crpo is ^ scaling parameter. 
The tensor at yj is then represented by 

(4) Uy,) = Yli^\,vi,f{v{,v^,). 

The result of the above method is to produce a set of 2 x 2 matrices lo- 
cated over a sparse set of locations. In order to create a field of orientations 
we must then interpolate to get a tensor field. Thus we use the orientation 
of the principal eigenvector, where defined, to construct a field of orienta- 
tions. 

Interpolation of tensors inevitably requires a notion of tensor metric. We 
elect to work in the log-Euclidean metric (see Arsigny et al., 2006). For 
an extended account of tensor metrics see Dryden, Koloydenko and Zhou 
(2009). Log-Euclidean calculations are simply Euclidean calculations on the 
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tensor logarithms which are transformed back to tensor space by taking 
the exponential. The tensors arising in this study can all be represented by 
positive definite matrices. Tensor logarithms are therefore well defined as 
logarithms of these matrices. However, the matrix calculated in Equation 
(4) will have a zero-eigenvalue if the points are collinear, and therefore not 
be positive definite. If all points are truly collinear then our approach breaks 
down - and indeed the method is not intended for such noise- free data sets. 
The more common situation is that one vector dominates the tensor repre- 
sentation as calculated in Equation (4) due to the relative distances between 
points. Typically this occurs if two points are close while other points are 
far from the pair. Due to rounding errors, the contribution of other points 
to the matrix becomes zero, and the two remaining points are collinear by 
definition. In order to avoid an error in the log-Euclidean calculation, if a 
tensor has at least one zero-eigenvalue then it is replaced by the "uninfor- 
mative" identity matrix, suggesting a lack of directional information. Thus 
we take a conservative approach that excludes any potentially misleading 
directional information. 

We calculate the interpolated tensor field T/,pq(x) for (x G W) as a kernel 
smoothing procedure, using a Gaussian kernel / with variance parameter 
hpQ in the log-Euclidean metric. Hence when the smoothing parameter hpo 
is positive, hpo > 0, 



The field of orientations vpoiui-, ■■■^Um] %o)(a^) iov x G W is then defined to 
be equal to tan~^(fi(x)/u2(2;)) where {vi{x),V2{x)) is the principal eigen- 
vector of the matrix representation of Thp^ix). 

In most instances this procedure will give a good estimation of a suitable 
field of orientations for modelling the point process with integral fibres. The 
smoothing method has the drawback that it can create a bias around areas 
of high curvature (rapidly varying orientation) in the field of orientations. 
Potential solutions have been analysed and found to perform well. The mag- 
nitude of the bias was found to be proportional to the smoothing parameter 
hpo, hence these solutions typically involve compensating for h-po: we give 
two examples: 

(A) We allow the smoothing parameter to vary over the window, such that 
lower values are used in areas of high point intensity. This ensures 
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that less information is extrapolated to regions where there is already 
sufficient data, and therefore less bias occurs in those regions. 

(B) We consider two instances of the field of orientations with different 
values of smoothing parameter h'^Q and /ipo' unbiased estimate 
can be found by extrapolating back to estimate the field of orientations 
with hpo = 0- 

We do not go into further detail here because doing so would distract from 
the main ideas in this paper but we do notice a minor effect of this bias in 
the examples in Section 6. Details of these bias corrections can be found in 
Hih (2011). 

4.2. Estimation using Signal Probabilities. We extend this field of orienta- 
tions estimation to take account of the vector of probabilities that points 
are signal {ei,e2, ■■■,£m) by weighting the construction of the initial ten- 
sor and also weighting the contribution of each initial tensor to the kernel 
smoothing. 

Specifically, the initial tensors are represented by 

(6) To{y,) = Y.{^lv^2f{v\,v^2)e^ 

for each point and the tensor field becomes 

Ey,g{yi,...,y^}e»/(dist(3:,j/i))log(ro(j/j)) \ 

This weighting allows points that are more likely to be signal points to have 
a greater effect on the field of orientations estimation. As — )• the effect 
of the point yi on the field of orientations tends to zero, whereas if = 1 for 
all i we would be performing the calculation described in Section 4.1. 

5. Sampling from the Posterior Distribution. We seek to infer some 
characteristics of the fibre process when only the point pattern is known. 
Typical attributes of interest include the number of fibres, where they are 
located/orientated, which points arose from which fibre and which points 
arose from background noise. 

Direct inference from the model is hindered by the complexity of its hi- 
erarchical structure. Hence we choose to draw samples from the posterior 



(7) Th^oi^) 
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distribution of the fibres and other variables using Markov Chain Monte 
Carlo methods. Characteristics of interest can be estimated from these sam- 
ples. 

5.1. Hyperparameters. As a rough guideline, hyperparameters can be cho- 
sen as follows. 

The prior mean number of fibres k and the prior mean length of fibres A can 
be estimated from any prior knowledge or expectations of the fibres. The 
deviation of points from fibres cr^j^p can be estimated using prior knowledge 
of fibre widths and the approximation that 95% of points should lie within 
2.45<Tdisp of the centre of a fibre. The density of points per unit length of 
fibre r] can be similarly estimated. 

Orientation field parameters /ipo and crpo should be chosen to ensure the 
orientation field is smooth. These can be estimated by evaluating the orien- 
tation fields for different selections of /ipO) ^FO and choosing from this set. 
If the proportion of noise points is approximately known then the hyperpa- 
rameters asignai and /3signai Can be suitably estimated, however we suggest 
choosing the parameters such that asignah /^Signal > 1 to ensure good mixing 
properties of the Markov Chain Monte Carlo sampling algorithm. Other- 
wise the noise hyperparameters can be set equal to 1 indicating no prior 
knowledge. 

Alternatively, if little prior information is known about the nature of the 
latent curvilinear structure, then it would be feasible to extend the empirical 
Bayes step to include the estimation of further prior parameters. 

5.2. Birth-Death Monte Carlo. The starting point for our algorithm is a 
continuous time Birth-Death Markov Chain Monte Carlo (BDMCMC) in 
which fibres are created and die at random times controlled by predeter- 
mined or calculated rates. This enables exploration of a wide range of models 
with different numbers of fibres, and is suited to this type of clustered data. 
See M0ller and Waagepetersen (2004) for an introduction to spatial Birth- 
Death processes. For the sake of brevity we present in the following the key 
points of the algorithm. Further details can be found in Hill (2011). 

Here we choose to fix the birth rate and calculate an appropriate death rate 
at each step to maintain detailed balance. 

Following a birth or death we update the following auxiliary variables: Z to 
Z', the indicators of the components (signal/noise) to which the points are 
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associated; X to X', the indicators of the fibres to which the signal points 
are associated; p to p', the vector (pi, ...,Pm) where pi is the point on the 
fibre to which the data point yi is associated. 

5.3. Births of Fibres. Recall that the parameterisation of fibres is described 
in Section 2.2. Birth events occur randomly at rate (3. Upon the occur- 
rence of a birth, the number of fibres is updated from k to k + I, and 
a new fibre is introduced by sampling a reference point uj^+i and lengths 
h+i,i,lk+i,2 from the prior distributions P{u)),P{l) respectively. The new 
fibre -Ffc+i is then calculated by integrating the field of orientations accord- 
ing to these parameters, and the set of fibres F = Fi, ...,Fk is updated to 
F' = Fi, Ffc, Ffc+i. In order to ensure that the distribution of the lengths 
4+1,1, lk+1,2 is independent of the respective directions in which the field of 
orientations is integrated, we choose them to be independently and identi- 
cally distributed. 

Data points which are currently assigned to the noise component are reas- 
signed to noise or signal with proposal probability dependent on the new 
fibre QbirthCZ H> Z'\Z, e, Fk+i). 

Finally new values are proposed for all of X, p according to a proposal den- 
sity Q(X., p I— 7- X', p'). For simplicity, we choose not to sample from the full 
conditional distribution of p and X, but rather from a density proportional 
to the likelihood L(p, X|y). 

In full, the birth density of fibre F^+i including updates of auxiliary variables 
to Z', X', p' is given by 

(8) ^(Ffc+i, ojfc+i, /i^fc+i, /2,fc+i) Z', X', p') = 

m^k+i)P{h,k+i)P{l2,k+i)Qbirth{^ ^ Z'|Z, e, Ffc+i)Q(X, p ^ X', p'). 

5.4. Deaths of Fibres. We must calculate the death rate 6j for each fibre to 
ensure detailed balance holds. Following the death of fibre Fj the variables 
F, u and 1 are updated by omitting the jth term. Further, auxiliary variables 
Z, X and p are all updated. All points allocated to fibre Fj are now allocated 
to noise. We call this trivial proposal density (5deat/i(Z'|Z, X, j). Again the 
final step is to propose new values for all of X and p according to proposal 
density (5(X, p i-^ X', p'). 

Hence the death rate that satisfies detailed balance for the jth fibre is given 
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l2\^2,j|fc - 1) P{u\cOj\k - 1) P{k - 1) 

P(Li,L2|fc) P{u:\k) Pik) 

b{Fj,u}j, hjjhj, X^, p^) 
QdeathiZ.'\Z, X, i)Q(X, p ^ X', pO 
P{k - 1) PQMrthjZ' ^ Z|e, F,)Q(X^ p^ ^ X, p) 

P{k) QdeathC^'lZ, X,i)Q(X, p ^ X', p') 



5.5. Additional Moves. It can be desirable to add extra moves to the BDM- 
CMC process to improve mixing. Some possible moves which were all utilized 
in the examples in Section 6 include 

• Moving a fibre by a small amount (by perturbing the reference point), 

• Resampling the lengths of a fibre (whilst keeping the reference point 
fixed) . 

Each of these events occur at some predefined rate, whence they are pro- 
posed and either accepted or rejected according to the Metropolis Hastings 
probability. 

We may also wish to update other model variables, giving more fiexibility 
and improving the algorithm's exploration of the sample space. The addi- 
tional variable updates used in the examples in Section 6 include 

• Proposing new signal-noise allocations of the data (Z), 

• Proposing new signal probabilities (e) according to non-degenerate 
Beta distributions whose parameters depend on the current signal- 
noise allocations Z. This move leads to an update in the prior for the 
field of orientations due to the empirical Bayes step, hence all fibres 
are resampled. 

Details of all moves can be found in Hill (2011). 

Hyperprior parameters, such as the constant of proportionality r/ in the prior 
for the Poisson-distributed number of points or a^isp governing the deviation 
of points from fibres, may also be updated. We have chosen not to update 
any hyperprior parameters to reduce complexity of the model. 



5.6. Convergence and Output Analysis. First recall that the signal proba- 
bilities e are updated according to non-singular Beta distributions. Hence, 
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the underlying tensor field as defined in (7) will not become degenerate even 
when Z allocates all points to noise. 

Consider the set A of states in which the fibre configuration is empty and 
all points are allocated to noise. In the following discussion we exclude any 
degenerate states of equilibrium probability zero. Inspection of the algorithm 
shows that the set A can be reached from any non-degenerate state in finite 
time and so the Birth-Death process is (/)-irreducible. Recurrence can be 
deduced by noting that the set A is visited infinitely often, see Kaspi and 
Mandelbaum (1994). 

We motivate a heuristic lower bound on a suitable burn-in time by consider- 
ing aspects of the prior derived after inspection of the data (e.g. (Tdisp, A, «; - 
see Section 5.1), and estimating the number of fibre births that must occur 
before a fibre has been created around each potential fibre cluster. We ap- 
proximate the lower bound by considering the number of fibre births required 
for this to happen around the smallest suitable cluster of points. 

A lower bound on half the length of the shortest suitable cluster is derived 
from the 10% quantile of an Exponentially distributed random variable of 
rate k/X. Then the probability that a point chosen at random from W lies 
in a region corresponding to an actual fibre of this length (up to 2(Tdisp from 
the fibre) is approximated by 

.... 8Alog(10/9)adisp 
^ ' k\W\ 

It follows that, with probability 0.99, a fibre will be proposed in the region 
corresponding to the shortest fibre within the first 

(11) - log(O.Ol) 



log ( 1 - 8^1°g(lO/9)<7dis; 



k\W\ 

births. Hence we choose a burn-in time of 

(12) Tburn = max { 1500, r^°|^n/Q^ ^ 

taking 1500 as a lower bound to ensure the burn-in time remains substan- 
tial. 



Convergence was assessed by considering variables such as the number of 
fibres k or the number of noise points and using Geweke's spectral density 
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diagnostic, see Brooks and Roberts (1998). Convergence of a sequence of n 
samples is rejected if the mean value of the variable in the first n/10 samples 
is not sufficiently similar to the mean value over the last n/2 samples. 

We also tested convergence by assessing whether the mean sum of the death 
rates is approximately equal to the birth rate /3. Consider Sl^^^^^t'' where (^totai 
is the sum of the death rates of fibres after the kth event (e.g. birth, death, 
etc) and is the length of algorithmic time before the next event. If the 
MCMC has reached stationarity then 

(13) z„,g^.^?o...'^-"'W^'^ + A,(„,i) 

where c^t^tai* estimate of the standard deviation of i^^.tai^'^' the 

birth rate of fibres, and radd is the sum of the rates of any additional moves 
implemented (as suggested in Section 5.5). We used this result to test the 
convergence of l/mJ2'^=i ^fotai*'' *o /?/ (2/3 + radd)- 

Bearing in mind the complexities of the underlying model, output analysis 
showed no evidence for a lack of convergence. 

Outputs of various variables are recorded at random times at some constant 
rate. The rate of this sampling (effectively the reciprocal of the thinning of 
the Monte Carlo process) is chosen such that there is a low probability that 
any of the fibres remain unchanged between samples. The inclusion of the 
extra moves designed to improve mixing also helps to decrease the thinning 
required. The thinning is chosen approximately proportional to the number 
of fibres (estimated based on aspects of priors derived from inspection of the 
data). 

6. Simulation Studies and Applications. The implemented algorithm 
runs on a continuous time scale. Events occur at a determined rate, either 
fixed or calculated to ensure that detailed balance holds. The units for the 
rate of an event are 'per unit of algorithm time'. The BDMCMC is then 
allowed to run for a large number of time units and samples are taken at 
random times (at some fixed rate). Of course the relationship of algorithm 
time to actual processing time depends on hardware and implementation 
details. Hardware details are described below. 

In each of the following examples, the birth rate, and the rate of other 
moves (moving a fibre, adjusting lengths of a fibre, proposing a split or 
a join, variable updates) were all unit rate. The only exceptions were the 
signal probability (e) updates which were proposed at a rate of 0.1 per unit 
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of time, and the recording of output variables at random times whose rate 
varied for different data sets. 

We evaluate the field of orientations over a square grid of points, each one 
unit length from its four nearest neighbours. The total size of this grid is 
given by the dimensions of the window W. 

All three examples were run on the cluster owned by the Statistics Depart- 
ment in the University of Warwick using a Dell PowerEdge 1950 server with 
a 3.16 GHz Intel Xeon Harpertown (X5460) processor and 16 GB fully- 
buffered RAM. The algorithm was implemented in Octave version 3.2.4.^ 
The total run-times on the cluster ranged from 34.7 hours for the fingerprint 
pore data (32300 units of algorithm time) to 61.3 hours for the earthquake 
data set (30000 units of algorithm time). Due to the limitations of the cur- 
rent version of Octave, the benefits of a parallel implementation have not 
yet been explored. 

Analysis has been performed on all four of the datasets shown in Figure 1. 
However for the sake of brevity we omit discussion of results for the first 
simulated point pattern (Figure 1(a)) from this paper. 

6.1. Simulated Example. Figure 4(a) shows the simulated data set used in 
Stanford and Raftery (2000). We include it here to facilitate comparison 
with the methods proposed by Stanford and Raftery. The data consists of 
200 signal points and 200 noise points over a 200 x 150 window, and is based 
on a family of two fibres each of length 157. 

The Birth-Death MCMC was run for 60000 units of algorithm time, the first 
30000 of which were discarded. Samples were taken at a rate of 0.033 per 
unit of time. The initial state was a randomly sampled set of k = 2 fibres. 
Other hyperparameters were chosen as follows; dispersion parameter (Jciisp — 
3; signal probability hyperparameters asignai = 1 and /3signai = 1; density 
parameter rj = 0.64; mean half-fibre length A = 78.5; and the Dirichlet 
parameter ODir = 1-5. 

Figures 4(b) to 4(d) show that our model fits the data very well, albeit with 
a slight extrapolation of fibres beyond the curves used to generate the data 
set. The two fibres in the sample in Figure 4(b) compare favourably with 
the principal curves fitted in Stanford and Raftery (2000). 



^The Octave code for this algorithm is available at 
URL http : //www2 .Warwick, ac .uk/go/ethoimes/f ibres. 
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♦ l.V • + A' 



50 10D 150 

(a) Simulated data. 




(b) A random sample from the BD- 
MCMC output. Fibres are represented 
by curves, pluses indicate points al- 
located to signal and crosses indicate 
points allocated to noise in this sample. 




(c) Estimate of the density of signal 
points found by smoothing a series of 
samples of fibres (darker areas indicate 
higher densities). Pluses indicate points 
allocated to signal and crosses indicate 
points allocated to noise in at least 50% 
of samples. The size of points represent- 
ing the data has been reduced to enhance 
the clarity of the density estimate. 




(d) Estimate of the clustering of the 
signal points - different symbols indi- 
cate different clusters, crosses indicate 
noise. Estimated by considering how of- 
ten pairs of points are associated with 
the same fibre across a number of sam- 
ples. 



Fig 4. Simulated Example from Stanford and Raftery (2000). 
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Table 1 gives the posterior probabilities of the number of fibres and the 
means and highest posterior density intervals of a variety of properties con- 
ditional on the number of fibres. The number of fibres is simply a count 
of the fibres present in each sample; in this example we expect it to be 
around 2. The number of points assigned to the noise component will typi- 
cally be closely correlated with the number of fibres. With more fibres comes 
a greater chance of there being a fibre close to a given point and hence a 
greater chance that it is a signal point. We take the 95th percentile of the dis- 
tances of signal points from their associated points on fibres for each sample. 
This summarizes the dispersion of points from the fibres. It is comparable to 
2.45(Tdisp) where (Tdisp is the dispersion parameter (set to 3 in this example). 
The constant 2.45 arises as the 95th percentile of the Euclidean distance 
from the origin to a bivariate standard-normal distributed random variable. 
The curvature bias in the field of orientations results in a mild bias on the 
95th percentile of distances from signal points to anchor points. 

In this example, less points are associated to noise than were simulated as 
noise points in the data generation. This is partly due to the high intensity 
of noise points, and also explained by a slight bias in the length of the fi- 
bres. The posterior statistics on the lengths of the fibres suggest that the 
extension of fibres beyond their known length (of 157) is supported by the 
high intensity of noise points. This extrapolation is sometimes beneficial, 
particularly for fibre reconstruction in areas of missing data. Here the ex- 
trapolation is less desirable as it suggests there is evidence for fibres in the 
background noise. 

The extrapolation of fibres into less dense regions of points can be reduced 
by choosing a higher Dirichlet parameter ODir for the distribution of anchor 
points along the fibres. This decreases the posterior density of fibres lying 
through point clusters of non-constant intensity. The drawback of increasing 
ODir is that a large value of aoir leads to a multimodal anchor point distri- 
bution with most of the probability mass concentrated at the modes. As the 
proposal of a birth/change of a fibre does not take account of the shape of 
anchor point distribution, the proposal of a state with low posterior density 
is more likely for larger aDir- 

6.2. Application: Earthquakes on the New Madrid Fault-line. The epicen- 
tres of earthquakes along seismic faults are a good example of point data 
clustered around a system of fibres with additional background noise. Here 
the fibres are the unknown fault-lines. Stanford and Raftery (2000) con- 
sider the structure of the data set of earthquakes around the New Madrid 
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Posterior Probabilities for Number of Fibres 


Number of Fibres 


2 


3 


4 


Posterior Probability 


0.73 


0.23 


0.04 


Other Properties 


Conditioned on the Number of Fibres 




Number of 


Posterior 


50% HPD 


95% HPD 




Fibres 


Mean 


Interval 


Interval 




2 


181.85 


[176,193] 


[156,205] 


Number of Noise Points 


3 


180.81 


[179,197] 


[149,201] 




4 


178.78 


[167,184] 


[155,197] 


95th Percentile of the 


2 


8.16 


[7.17,8.32] 


[6.64,9.66] 


Distances from Signal 


3 


8.06 


[7.28,8.27] 


[6.61,9.53] 


Points to Fibres 


4 


7.95 


[7.20,7.91] 


[6.75,9.50] 




2 


317.68 


[319,325] 


[301,325] 


Total Length of Fibres 


3 


319.60 


[315,322] 


[300,342] 




4 


320.10 


[325,303] 


[303,325] 



Table 1 



Results for Stanford and Raftery's Simulated Example: First sub-table gives posterior 
probabilities on the number of fibres, while the second gives posterior means and 50% and 

95% HPD (highest posterior density) intervals for a selection of properties of the 
posterior distribution conditional on the number of fibres. The simulated data consists of 
200 signal points and 200 noise points over a 200 x 150 window, and is based on a family 
of two fibres each of length 157. The dispersion parameter adisp is set to 3 and the prior 
mean probability that a point is noise is 0.5. Posterior probabilities only given if non-zero 

to rounding error. 

fault line in central USA. We use data on earthquakes in the New IVIadrid 
region between 1st Jan 2006 and 3rd Aug 2008 (inclusive) taken from the 
CERI (Center for Earthquake Research and Information) found at http: 
//www. ceri .memphis . edu/ seismic/ catalogs/ cat_nm.html. 

The Birth-Death MCMC was run for 40000 units of algorithm time, the 
first 10000 of which were discarded. Samples were taken at a rate of 0.0167 
per unit of time. The initial state was a randomly sampled set of k = 4 
fibres. Other hyperparameters were chosen as follows: dispersion parameter 
Cdisp = 2; signal probability hyperparameters Osignai = 4 and /^signal = 1; 
density parameter rj = 1.06; mean half- fibre length A = 30; and the Dirichlet 
parameter aDir = 1-5. 

Table 2 gives some numerical properties of the posterior distribution of fi- 
bres. 

Our method has the advantage over Stanford and Raftery (2000), in that 
it does not try to over-fit the fibres where there is less data. Rather it uses 
information from surrounding data to extrapolate fibres as required. One 
limitation of our model is that every fibre is assumed to share a number 
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(a) Earthquake data (b) A random sample from the BD- 

MCMC output. Fibres are represented 
by curves, pluses indicate points al- 
located to signal and crosses indicate 
points allocated to noise in this sample. 
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(c) Estimate of the density of signal 
points found by smoothing a series of 
samples of fibres (darker areas indicate 
higher densities). Pluses indicate points 
allocated to signal in at least 50% of sam- 
ples. The size of points representing the 
data has been reduced to enhance the 
clarity of the density estimate. 
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(d) Estimate of the clustering of the 
signal points - different symbols indi- 
cate different clusters, crosses indicate 
noise. Estimated by considering how of- 
ten pairs of points are associated with 
the same fibre across a number of sam- 
ples. 



Fig 5. New Madrid Fault Earthquake Data. 
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of properties. In particular the displacement of points from fibres (effec- 
tively the width of influence of a fibre) and the intensity of signal points per 
unit length of fibre are assumed to be constant, independent of the fibre. 
These assumptions are not reasonable for this data as the 'thickness' and 
density of points varies considerably. This is apparent in Figure 5(b) where 
the central dense cluster is described by multiple parallel fibres. The disper- 
sion parameter Udisp was chosen by considering the apparent 'width' of the 
longer thinner fibre, hence points around the shorter, wider fibre effectively 
increase the 95th percentile of the point to fibre distances, as given in Ta- 
ble 2. To overcome this issue one could extend the model to allow different 
hyperparameters for each fibre. 

While multiple fibres in the central cluster is a common feature in samples 
from this BDMCMC, Figure 5(d) indicates that the agglomerative clustering 
algorithm identifies the points as arising from the same cluster. 

Interestingly the total length of fibres does not appear to be positively cor- 
related to the number of fibres, suggesting that the additional fibres arise 
from splitting a fibre into multiple parts while preserving the total fibre 
length. 

6.3. Application: Fingerprint Data. The second application we consider is 
that of pores lying along ridge lines in fingerprints. Fingerprint pore data is 
considered in some depth in Su et al. (2008) and Su (2009). 

We used a portion of the data set extracted from fingerprint a002-05 from 
the NIST (National Institute of Standards and Technology) Special Database 
30 (Watson, 2001). 

The Birth-Death MCMC was run for 40000 units of algorithm time, the first 
8000 of which were discarded. Samples were taken at a rate of 0.007 per unit 
of time. The initial state was a randomly sampled set of k = 10 fibres. 

The fingerprint pore data will typically cause breakdown of nearest neigh- 
bour clustering methods. This is because, whilst the fibrous structure of the 
point pattern is clear when viewing the global picture, it is not so apparent 
on a small scale. This phenomena is partly due to the apparent inter-ridge 
alignment of points (from left to right in Figure 6(a)). By way of contrast, 
our field of orientations model takes any information available on a small 
scale and uses it across the window thanks to the smoothing step in the field 
of orientations. 

As Figure 6 shows, our model succeeds in fitting many of the fibres (or 
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Posterior Probabilities for Number of Fibres 


Number of Fibres 


6 


7 


8 


Posterior Probability 


0.56 


0.36 


0.07 


Other Properties 


Conditioned on the Number of Fibres 




Number of 


Posterior 


50% HPD 


95% HPD 




Fibres 


Mean 


Interval 


Interval 




6 


42.99 


[41,44] 


[38,48] 


Number of Noise Points 


7 


40.65 


[38,41] 


[36,45] 




8 


42.06 


[42,44] 


[35,45] 


95th Percentile of the 


6 


5.25 


[4.94,5.19] 


[4.96,5.60] 


Distances from Signal 


7 


5.29 


[5.12,5.40] 


[4.85,5.89] 


Points to Fibres 


8 


5.17 


[4.94,5.22] 


[4.74,5.65] 




6 


257.24 


[247,257] 


[246,269] 


Total Length of Fibres 


7 


257.43 


[257,262] 


[249,264] 




8 


252.89 


[250,252] 


[248,265] 



Table 2 



Results for Earthquake Data: First sub-table gtves postenor probabilities on the number 
of fibres, while the second gives posterior means and 50% and 95% HPD (highest 
posterior density) intervals for a selection of properties of the posterior distribution 
conditional on the number of fibres. The data are all the recorded earthquakes in the New 
Madrid region between 1st Jan 2006 and 3rd Aug 2008; the data were acquired from the 

CERI (Center for Earthquake Research and Information) found at 
http: // www. ceri . memphis . edu/ seismic/ catalogs/ cat_nm. html . In total there are 
317 points in a 300 x 300 window, the dispersion parameter adisp is set to 2 and the prior 
mean probability that a point is noise is 0.2. Posterior probabilities only given if non-zero 

to rounding error. 
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(a) Pore data (b) A random sample from the BD- 

MCMC output. Fibres are represented 
by curves, pluses indicate points al- 
located to signal and crosses indicate 
points allocated to noise in this sample. 



20 



40 



60 



80 



100 



+ + 



+ + + 



+ + 



4. 

+ + + 



+ 

++ + 



+ 

+ 

+ + 



+ + + + + 



+ + X + >: 



20 



40 



60 



80 



100 



(c) Estimate of the density of signal 
points found by smoothing a series of 
samples of fibres (darker areas indicate 
higher densities). The size of points rep- 
resenting the data has been reduced to 
enhance the clarity of the density esti- 
mate. 




(d) Estimate of the clustering of the 
signal points - different symbols indi- 
cate different clusters, crosses indicate 
noise. Estimated by considering how of- 
ten pairs of points are associated with 
the same fibre across a number of sam- 
ples. 



Fig 6. Pores from portion of fingerprint a002-05 from the NIST Special Database 30 
Watson (2001). 
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Posterior Probabilities for Number of Fibres 



Number of Fibres 



13 



14 



15 



16 



17 



18 



19 



20 



Posterior Probability 



0.03 



0.11 



0.17 0.17 



0.25 



0.11 



0.09 



0.05 



Other Properties Conditioned on the Number of Fibres 





Number of 


Posterior 


OU /o n-r\J 


yo/o rlr^JJ 




Fibres h 


iviean 


Interval 


Interval 






14. 40 


[14,16] 


[9,18] 




15 


15.71 


[12,15] 


[12,21] 


Number of Noise Points 


16 


15.00 


[12,17] 


[8,21] 




17 


18.86 


[16,19] 


[16,23] 




18 


16.08 


[15,18] 


[10,25] 




14 


3.76 


[3.74,3.82] 


[3.25,4.22] 


95tli Percentile of the 


15 


3.68 


[3.52,3.69] 


[3.32,4.77] 


Distances from Signal 


16 


3.56 


[2.34,3.52] 


[3.34,3.89] 


Points to Fibres 


17 


3.58 


[3.38,3.64] 


[3.23,3.95] 




18 


3.67 


[3.44,3.75] 


[3.24,4.38] 




14 


862.18 


[861,883] 


[785,964] 




15 


882.38 


[872,933] 


[818,966] 


Total Length of Fibres 


16 


864.75 


[836,886] 


[784,927] 




17 


814.43 


[788,804] 


[788,878] 




18 


861.76 


[821,876] 


[761,941] 



Table 3 

Fingerprint Pore Data Set: Posterior means and 50% and 95% credible intervals of a 
selection of properties of the posterior distribution conditional on the number of fibres. 
The data was extracted from a portion of fingerprint a002-05 from the NIST (National 
Institute of Standards and Technology) Special Database 30 (Watson, 2001). It consists 
of 123 points on a 100 x 100 window. A dispersion parameter of adisp = 1-5 is used, and 
the mean prior probability a point is noise is 0.091. Posterior probabilities only given if 

non-zero to rounding error. 



fingerprint ridges) to the pore data. Figure 6(c) indicates areas of doubt in 
tlie fibre locations where the shading is lighter near the edges of the window, 
showing that fibre samples were more dispersed. 

This data set is an ideal candidate for reconstruction of missing data. We 
work under the assumptions that pores lie at fairly regularly intervals along 
ridges, but some are not identified during the pore extraction process. Our 
method uses information from nearby ridges to complete fibres where data 
is missing. In this example this is particularly evident in the region below 
the centre of the window. Knowledge of the posterior distribution of fibres 
could lead to a 'filling in the gaps' approach to reconstructing the missing 
pore data. 

Table 3 gives some numerical properties of the posterior distribution of fi- 
bres. 
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7. Discussion. In this paper we have identified a new model for fibre 
processes and for point processes generated from a fibre process. We have 
shown how Monte Carlo methods can be used to sample from the posterior 
distribution of a fibre process that is instrumental in generating a point 
process. 

Many different data sets of this type arise in nature. We investigated earth- 
quakes that cluster around fault-lines and pores in fingerprints that are 
situated along the fingerprint ridges. Other data can be found in catalogues 
of galaxies in the visible universe. Galaxies are known to align themselves 
along 'cosmic' filaments which, in turn, connect to form a web-like structure. 
Understanding these fibres and identifying where they lie is of great interest 
to cosmologists, see for example Martinez and Saar (2002) for a statistical 
overview of some current ideas and also Stoica, Martinez and Saar (2007) for 
a different approach to modelling the filament structure. Other data sets for 
which this model may be suitable include the locations of land mines, often 
placed in straight lines. Identifying these lines may aid in the discovery of 
currently undetected mines. Similar methods of detecting structure in noisy 
pictures are a prominent area of research in image recognition. 

This process can be used to fit non-parametric curves to point patterns with 
just two limitations on the nature of the curves. The limitations are that 
the curves must not intersect, and that they must be 'sufficiently' smooth 
(i.e. there must be no acute angles in the discretization of the fibres). The 
smoothness property is desirable to identify smooth curves rather than com- 
plex structures. The non- intersection property may be less desirable but, at 
some computational cost, the model could be generalized to allow each fibre 
to integrate a different field of orientations. 

We do not make use of a deterministic algorithm (such as the EM-algorithm) 
to fit the fibres, and our approach is not highly sensitive to the choice of 
starting parameters. Therefore it can be used to provide interval estimates 
for various parameters. One of the most sensitive parameters fixed in the 
algorithm is cr^j^p which governs the deviation of points from the fibres. If 
chosen too large the result will be too few fibres with a sizeable error in 
their locations. If chosen too small, fibre clusters may be split into multiple 
parallel smaller clusters. Our experience is that the algorithm is reasonably 
robust to changes in other parameters. 

One strength of our model is that it fits the noise-signal and cluster allo- 
cations implicitly, in contrast to other cases where the clustering may need 
to be predetermined. The advantage is that we can produce reliability esti- 
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mates for these clustering and noise allocations and explore more potential 
clustering configurations, and hence more fibre structures. 

A limitation of our model arises from the constraints on the similarity of 
fibres. Fibres are assumed to be of the same width (the displacement of 
points from the fibres is independent of the fibre), and have the same mean 
points per unit fibre length. These are not always reasonable assumptions, 
as is evidenced by the earthquake data set. We could extend the model to 
allow parameters c^j^p and t] to take different values for each fibre in order 
to eliminate this issue. A further extension would be to include isotropic 
clusters of points which do not fit well to the 'fibre' model. 

The complexity of the model, considering the infinite dimensionality of the 
field of orientations, raises the question of whether or not the Markov Chain 
adequately explores the sample space. Our examples indicate that, whilst 
the sample space of fields of orientations is not explored particularly well, 
the space of fibre configurations is well explored and the field of orientations 
varies enough to explore a wide space of fibre configurations. However as the 
density of fibres increases so the MCMC algorithm requires longer runtime 
in order to overcome these issues. 

Note that while our model performs as well as other available techniques 
on the basic data sets, it demonstrates significantly better performance on 
the fingerprint data where a large number of dense fibre clusters account for 
most of the data. 

It is necessary to bear in mind the ramifications of edge effects in the model 
and subsequently the MCMC algorithm. As we are sampling from a bounded 

subset W C M2 ti^g 

omission of potential points just outside W induces a 
bias on distance-related measures. The field of orientations will have a bias 
at the edge favouring orientations parallel to the sides of a rectangular win- 
dow W. Fibres are created by sampling a random reference point from the 
field and integrating the field of orientations from that point. However the 
reference point cannot be sampled from outside W, and fibres that extend 
past the boundary of W are typically terminated on the border as no field of 
orientations is available past that point. Also, the model for the displacement 
of points from fibres does not account for edge effects. Most of these algo- 
rithmic biases would be significantly decreased by creating a border around 
W and completing the analysis over the whole area. However this would 
come at an additional computational cost. 

We have commented in passing on the phenomenon of curvature bias and 
its effects on the estimation of parameters, and we note this as a fruitful 
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area for future research. Further research possibihties include the fitting of 
2 dimensional surfaces in 3 dimensions. Then new geometric issues need to 
be taken into account; for example it is not the case that a generic field of 
tangent planes can be developed into a fibration by surfaces. It is hoped to 
investigate this problem in further work. 
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